counterfactual evaluation
Counterfactual Evaluation of Peer-Review Assignment Policies
Peer review assignment algorithms aim to match research papers to suitable expert reviewers, working to maximize the quality of the resulting reviews. A key challenge in designing effective assignment policies is evaluating how changes to the assignment algorithm map to changes in review quality. In this work, we leverage recently proposed policies that introduce randomness in peer-review assignment--in order to mitigate fraud--as a valuable opportunity to evaluate counterfactual assignment policies. Specifically, we exploit how such randomized assignments provide a positive probability of observing the reviews of many assignment policies of interest. To address challenges in applying standard off-policy evaluation methods, such as violations of positivity, we introduce novel methods for partial identification based on monotonicity and Lipschitz smoothness assumptions for the mapping between reviewer-paper covariates and outcomes. We apply our methods to peer-review data from two computer science venues: the TPDP'21 workshop (95 papers and 35 reviewers) and the AAAI'22 conference (8,450 papers and 3,145 reviewers). We consider estimates of (i) the effect on review quality when changing weights in the assignment algorithm, e.g., weighting reviewers' bids vs. textual similarity (between the review's past papers and the submission), and (ii) the cost of randomization, capturing the difference in expected quality between the perturbed and unperturbed optimal match. We find that placing higher weight on text similarity results in higher review quality and that introducing randomization in the reviewer-paper assignment only marginally reduces the review quality. Our methods for partial identification may be of independent interest, while our off-policy approach can likely find use in evaluating a broad class of algorithmic matching systems.
Harnessing the Power of Interleaving and Counterfactual Evaluation for Airbnb Search Ranking
Zhang, Qing, Deng, Alex, Du, Michelle, Gao, Huiji, He, Liwei, Katariya, Sanjeev
Evaluation plays a crucial role in the development of ranking algorithms on search and recommender systems. It enables online platforms to create user-friendly features that drive commercial success in a steady and effective manner. The online environment is particularly conducive to applying causal inference techniques, such as randomized controlled experiments (known as A/B test), which are often more challenging to implement in fields like medicine and public policy. However, businesses face unique challenges when it comes to effective A/B test. Specifically, achieving sufficient statistical power for conversion-based metrics can be time-consuming, especially for significant purchases like booking accommodations. While offline evaluations are quicker and more cost-effective, they often lack accuracy and are inadequate for selecting candidates for A/B test. To address these challenges, we developed interleaving and counterfactual evaluation methods to facilitate rapid online assessments for identifying the most promising candidates for A/B tests. Our approach not only increased the sensitivity of experiments by a factor of up to 100 (depending on the approach and metrics) compared to traditional A/B testing but also streamlined the experimental process. The practical insights gained from usage in production can also benefit organizations with similar interests.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > Canada > Ontario > Toronto (0.05)
- North America > United States > Washington > King County > Seattle (0.04)
- (2 more...)
- Research Report > Strength High (1.00)
- Research Report > Experimental Study (1.00)
Counterfactual Evaluation of Ads Ranking Models through Domain Adaptation
Radwan, Mohamed A., Bhattacharjee, Himaghna, Lanners, Quinn, Zhang, Jiasheng, Karakulak, Serkan, Nassif, Houssam, Bayir, Murat Ali
We propose a domain-adapted reward model that works alongside an Offline A/B testing system for evaluating ranking models. This approach effectively measures reward for ranking model changes in large-scale Ads recommender systems, where model-free methods like IPS are not feasible. Our experiments demonstrate that the proposed technique outperforms both the vanilla IPS method and approaches using non-generalized reward models.
- Europe > Italy > Apulia > Bari (0.06)
- North America > United States > New York > New York County > New York City (0.05)
- Oceania > Australia > Victoria > Melbourne (0.04)
- Asia > Middle East > Israel (0.04)
- North America > United States > Washington > King County > Seattle (0.05)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > Canada > British Columbia (0.04)
Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions
McInerney, James, Brost, Brian, Chandar, Praveen, Mehrotra, Rishabh, Carterette, Ben
Users of music streaming, video streaming, news recommendation, Offline evaluation is challenging because the deployed recommender and e-commerce services often engage with content in a sequential decides which items the user sees, introducing significant manner. Providing and evaluating good sequences of recommendations exposure bias in logged data [7, 16, 22]. Various methods have been is therefore a central problem for these services. Prior proposed to mitigate bias using counterfactual evaluation. In this reweighting-based counterfactual evaluation methods either suffer paper, we use terminology from the multi-armed bandit framework from high variance or make strong independence assumptions to discuss these methods: the recommender performs an action about rewards. We propose a new counterfactual estimator that allows by showing an item depending on the observed context (e.g., user for sequential interactions in the rewards with lower variance covariates, item covariates, time of day, day of the week) and then in an asymptotically unbiased manner. Our method uses graphical observes a reward through the user response (e.g., a stream, a purchase, assumptions about the causal relationships of the slate to reweight or length of consumption) [14]. The recommender follows the rewards in the logging policy in a way that approximates the a policy distribution over actions by drawing items stochastically expected sum of rewards under the target policy. Extensive experiments conditioned on the context. in simulation and on a live recommender system show that The basic idea of counterfactual evaluation is to estimate how a our approach outperforms existing methods in terms of bias and new policy would have performed if it had been deployed instead data efficiency for the sequential track recommendations problem. of the deployed policy.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Los Gatos (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
Counterfactual Evaluation of Treatment Assignment Functions with Networked Observational Data
Guo, Ruocheng, Li, Jundong, Liu, Huan
Counterfactual evaluation of novel treatment assignment functions (e.g., advertising algorithms and recommender systems) is one of the most crucial causal inference problems for practitioners. Traditionally, randomized controlled trials (A/B tests) are performed to evaluate treatment assignment functions. However, such trials can be time-consuming, expensive, and even unethical in some cases. Therefore, offline counterfactual evaluation of treatment assignment functions becomes a pressing issue because a massive amount of observational data is available in today's big data era. Counterfactual evaluation requires handling the hidden confounders -- the unmeasured features which causally influence both the treatment assignment and the outcome. To deal with the hidden confounders, most of the existing methods rely on the assumption of no hidden confounders. However, this assumption can be untenable in the context of massive observational data. When such data comes with network information, the later can be potentially useful to correct hidden confounding bias. As such, we first formulate a novel problem, counterfactual evaluation of treatment assignment functions with networked observational data. Then, we investigate the following research questions: How can we utilize network information in counterfactual evaluation? Can network information improve the estimates in counterfactual evaluation? Toward answering these questions, first, we propose a novel framework, \emph{Counterfactual Network Evaluator} (CONE), which (1) learns partial representations of latent confounders under the supervision of observed treatments and outcomes; and (2) combines them for counterfactual evaluation. Then through extensive experiments, we corroborate the effectiveness of CONE. The results imply that incorporating network information mitigates hidden confounding bias in counterfactual evaluation.
Counterfactual Risk Assessments, Evaluation, and Fairness
Coston, Amanda, Chouldechova, Alexandra, Kennedy, Edward H.
Algorithmic risk assessments are increasingly used to help humans make decisions in high-stakes settings, such as medicine, criminal justice and education. In each of these cases, the purpose of the risk assessment tool is to inform actions, such as medical treatments or release conditions, often with the aim of reducing the likelihood of an adverse event such as hospital readmission or recidivism. Problematically, most tools are trained and evaluated on historical data in which the outcomes observed depend on the historical decision-making policy. These tools thus reflect risk under the historical policy, rather than under the different decision options that the tool is intended to inform. Even when tools are constructed to predict risk under a specific decision, they are often improperly evaluated as predictors of the target outcome. Focusing on the evaluation task, in this paper we define counterfactual analogues of common predictive performance and algorithmic fairness metrics that we argue are better suited for the decision-making context. We introduce a new method for estimating the proposed metrics using doubly robust estimation. We provide theoretical results that show that only under strong conditions can fairness according to the standard metric and the counterfactual metric simultaneously hold. Consequently, fairness-promoting methods that target parity in a standard fairness metric may --- and as we show empirically, do --- induce greater imbalance in the counterfactual analogue. We provide empirical comparisons on both synthetic data and a real world child welfare dataset to demonstrate how the proposed method improves upon standard practice.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- North America > United States > Oregon (0.04)
- North America > United States > Ohio (0.04)
- (3 more...)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.93)
- Law (1.00)
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.82)
Counterfactual Evaluation of Machine Learning Models
So I'm sure many of you know Stripe. It's a company that provides a platform for e-commerce. And one of the things that everyone encounters when conducting commerce online is, unsurprisingly, fraud. So before I get into the details of how we address fraud with machine learning, I want to talk a little bit about the fraud life cycle. So what typically happens in fraud is that you have an organized crime ring install malware on point-of-sale devices. For example, there was this famous breach at Target about five years ago. So you can actually go online, if you go to the deep web and buy credit card numbers that were taken from personal devices, ATM machines and so forth. What's kind of surprising and funny is that these criminals who are selling credit card numbers to smaller time criminals are quite customer service oriented. So you can say, "I want 12 credit card numbers from Wells Fargo or Citibank. I want credit card numbers that were issued in the zip codes in 94102 to 94105 and so forth." Some of them are in fact so customer serviced oriented that they guarantee you that if you are unable to commit fraud with the cards you buy, they'll give you your money back. Let's say, five years at Stripe was enough for me. I decided to leave and become a criminal, using all my knowledge.
- Banking & Finance (1.00)
- Law Enforcement & Public Safety > Fraud (0.46)
- Information Technology > Services (0.34)
- Information Technology > Security & Privacy (0.34)
Michael Manapat: Counterfactual evaluation of machine learning models
PyData Seattle 2015 Machine learning models often result in actions: search results are reordered, fraudulent transactions are blocked, etc. But how do you evaluate model performance when you are altering the distribution of outcomes? I'll describe how injecting randomness in production allows you to evaluate current models correctly and generate unbiased training data for new models. Stripe processes billions of dollars in payments a year and uses machine learning to detect and stop fraudulent transactions. Like models used for ad and search ranking, Stripe's models don't just score---they dictate actions that directly change outcomes.